optimality equation
Finding Probably Approximate Optimal Solutions by Training to Estimate the Optimal Values of Subproblems
Megiddo, Nimrod, Wasserkrug, Segev, Davidovich, Orit, Shtern, Shimrit
The paper is about developing a solver for maximizing a real-valued function of binary variables. The solver relies on an algorithm that estimates the optimal objective-function value of instances from the underlying distribution of objectives and their respective sub-instances. The training of the estimator is based on an inequality that facilitates the use of the expected total deviation from optimality conditions as a loss function rather than the objective-function itself. Thus, it does not calculate values of policies, nor does it rely on solved instances.
On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes
Wan, Yi, Yu, Huizhen, Sutton, Richard S.
This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic convergence. We also characterize the sets to which RVI Q-learning algorithms converge, showing that they are compact, connected, potentially nonconvex, and comprised of solutions to the average-reward optimality equation, with exactly one less degree of freedom than the general solution set of this equation. Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.
Deterministic Policies for Constrained Reinforcement Learning in Polynomial-Time
Constrained Reinforcement Learning (CRL) traditionally produces stochastic, expectationconstrained policies that can behave undesirably - imagine a self-driving car that randomly changes lanes or runs out of fuel. However, artificial decision-making systems must be predictable, trustworthy, and robust. One approach to ensuring these qualities is to focus on deterministic policies, which are inherently predictable and trustworthy. Moreover, they are easy to implement [10], reliable for autonomous vehicles [16, 12], and effective for multi-agent coordination [23]. Similarly, almost sure and anytime constraints [21] provide inherent trustworthiness and robustness, essential for applications in medicine [6, 22, 18], disaster relief [9, 29, 27], and resource management [20, 19, 24, 4]. Despite the advantages of deterministic policies and stricter constraints, their computation remains an open challenge in CRL. Our research aims to address this challenge by studying the computational complexity of computing deterministic policies for a wide range of constraint types. Consider a constrained Markov Decision Process (cMDP) denoted by M. Let C represent an arbitrary cost criterion and B be the available budget.
Robustness and risk-sensitivity in Markov decision processes
We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function.
The In-Sample Softmax for Offline Reinforcement Learning
Xiao, Chenjun, Wang, Han, Pan, Yangchen, White, Adam, White, Martha
Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an in-sample max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample softmax using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also wellsuited to fine-tuning. We release the code at github.com/hwang-ua/inac A common goal in reinforcement learning (RL) is to learn a control policy from data. In the offline setting, the agent has access to a batch of previously collected data. This data could have been gathered under a near-optimal behavior policy, from a mediocre policy, or a mixture of different policies (perhaps produced by several human operators). A key challenge is to be robust to this data gathering distribution, since we often do not have control over data collection in some application settings.
A Data-Driven Model-Reference Adaptive Control Approach Based on Reinforcement Learning
Abouheaf, Mohammed, Gueaieb, Wail, Spinello, Davide, Al-Sharhan, Salah
Model-reference adaptive systems refer to a consortium of techniques that guide plants to track desired reference trajectories. Approaches based on theories like Lyapunov, sliding surfaces, and backstepping are typically employed to advise adaptive control strategies. The resulting solutions are often challenged by the complexity of the reference model and those of the derived control strategies. Additionally, the explicit dependence of the control strategies on the process dynamics and reference dynamical models may contribute in degrading their efficiency in the face of uncertain or unknown dynamics. A model-reference adaptive solution is developed here for autonomous systems where it solves the Hamilton-Jacobi-Bellman equation of an error-based structure. The proposed approach describes the process with an integral temporal difference equation and solves it using an integral reinforcement learning mechanism. This is done in real-time without knowing or employing the dynamics of either the process or reference model in the control strategies. A class of aircraft is adopted to validate the proposed technique.
On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly Communicating MDPs
We show two average-reward off-policy control algorithms, Differential Q-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas & Borkar 2001), converge in weakly communicating MDPs. Weakly communicating MDPs are the most general MDPs that can be solved by a learning algorithm with a single stream of experience. The original convergence proofs of the two algorithms require that the solution set of the average-reward optimality equation only has one degree of freedom, which is not necessarily true for weakly communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly communicating MDPs. As a direct extension, we show that average-reward options algorithms for temporal abstraction introduced by Wan, Naik, & Sutton (2021b) converge if the Semi-MDP induced by options is weakly communicating.